For this assignment, as it is a very popular topic, we wanted to know if we can achieve a conclusion about what sort of music is most popular. Knowing that nowadays the music is streamed in different platforms.
We thought that factors as danceability in music will make them more famous, but we cannot be sure about what other characteristics can affect the number of streams of song. So, we decided to look up through clustering the Spotify data we downloaded from Kaggle (https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023).
The data is about the most famous songs of the year 2023 within different platforms. It contains several indices that shows the musical attributes of the song such as bpm, energy, acousticness and danceability.
The first step is to get a broad idea about how the data looks like so we can make the corresponding computations that will make our analysis easier.
spotify.df = read.csv(file = "spotify-2023.csv", header=T, sep=",", dec=".")
head(spotify.df)
And a bit more information about each column.
str(spotify.df)
## 'data.frame': 953 obs. of 24 variables:
## $ track_name : chr "Seven (feat. Latto) (Explicit Ver.)" "LALA" "vampire" "Cruel Summer" ...
## $ artist.s._name : chr "Latto, Jung Kook" "Myke Towers" "Olivia Rodrigo" "Taylor Swift" ...
## $ artist_count : int 2 1 1 1 1 2 2 1 1 2 ...
## $ released_year : int 2023 2023 2023 2019 2023 2023 2023 2023 2023 2023 ...
## $ released_month : int 7 3 6 8 5 6 3 7 5 3 ...
## $ released_day : int 14 23 30 23 18 1 16 7 15 17 ...
## $ in_spotify_playlists: int 553 1474 1397 7858 3133 2186 3090 714 1096 2953 ...
## $ in_spotify_charts : int 147 48 113 100 50 91 50 43 83 44 ...
## $ streams : num 1.41e+08 1.34e+08 1.40e+08 8.01e+08 3.03e+08 ...
## $ in_apple_playlists : int 43 48 94 116 84 67 34 25 60 49 ...
## $ in_apple_charts : int 263 126 207 207 133 213 222 89 210 110 ...
## $ in_deezer_playlists : int 45 58 91 125 87 88 43 30 48 66 ...
## $ in_deezer_charts : int 10 14 14 12 15 17 13 13 11 13 ...
## $ in_shazam_charts : int 826 382 949 548 425 946 418 194 953 339 ...
## $ bpm : int 125 92 138 170 144 141 148 100 130 170 ...
## $ key : chr "B" "C#" "F" "A" ...
## $ mode : chr "Major" "Major" "Major" "Major" ...
## $ danceability_. : int 80 71 51 55 65 92 67 67 85 81 ...
## $ valence_. : int 89 61 32 58 23 66 83 26 22 56 ...
## $ energy_. : int 83 74 53 72 80 58 76 71 62 48 ...
## $ acousticness_. : int 31 7 17 11 14 19 48 37 12 21 ...
## $ instrumentalness_. : int 0 0 0 0 63 0 0 0 0 0 ...
## $ liveness_. : int 8 10 31 11 11 8 8 11 28 8 ...
## $ speechiness_. : int 4 4 6 15 6 24 3 4 9 33 ...
At a first glance, we can see that the key and mode data have type character, thus it might be a problem for further computations, so we have to be careful with that.
dim(spotify.df)
## [1] 953 24
We have 953 rows and 24 columns. To talk briefly about the columns, we see that this dataset is mainly a comparison of the popularity of a song in different platforms such as Spotify or apple music. In adding to this, there is some specific information about each song like danceability or release date. After this broad approach, we are now going to explain each column in a more concrete way:
After our first glance at the data, we can now come up with some questions about this dataset:
To achieve this, our first step will be to analyse the data and then try to reduce the dimension of it so that we can find what is the main attributes that influence the amount of the streams.
All these questions are very interesting but if we want to actually answer them, the very first thing we need to do is clean up the data because it can affect our predictions.
Firstly, we checked the variables that we wanted to use, and we saw that there are some variables that are too specific and won’t be useful for our analysis, so we are going to remove them. Therefore, we proceed to remove the variable “released_day”.
spotify.df = spotify.df[-c(6)]
summary(spotify.df)
## track_name artist.s._name artist_count released_year
## Length:953 Length:953 Min. :1.000 Min. :1930
## Class :character Class :character 1st Qu.:1.000 1st Qu.:2020
## Mode :character Mode :character Median :1.000 Median :2022
## Mean :1.556 Mean :2018
## 3rd Qu.:2.000 3rd Qu.:2022
## Max. :8.000 Max. :2023
##
## released_month in_spotify_playlists in_spotify_charts streams
## Min. : 1.000 Min. : 31 Min. : 0.00 Min. :2.762e+03
## 1st Qu.: 3.000 1st Qu.: 875 1st Qu.: 0.00 1st Qu.:1.417e+08
## Median : 6.000 Median : 2224 Median : 3.00 Median :2.902e+08
## Mean : 6.034 Mean : 5200 Mean : 12.01 Mean :5.138e+08
## 3rd Qu.: 9.000 3rd Qu.: 5542 3rd Qu.: 16.00 3rd Qu.:6.738e+08
## Max. :12.000 Max. :52898 Max. :147.00 Max. :3.704e+09
##
## in_apple_playlists in_apple_charts in_deezer_playlists in_deezer_charts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 13.00 1st Qu.: 7.00 1st Qu.: 13.0 1st Qu.: 0.000
## Median : 34.00 Median : 38.00 Median : 44.0 Median : 0.000
## Mean : 67.81 Mean : 51.91 Mean : 385.2 Mean : 2.666
## 3rd Qu.: 88.00 3rd Qu.: 87.00 3rd Qu.: 164.0 3rd Qu.: 2.000
## Max. :672.00 Max. :275.00 Max. :12367.0 Max. :58.000
##
## in_shazam_charts bpm key mode
## Min. : 0 Min. : 65.0 Length:953 Length:953
## 1st Qu.: 0 1st Qu.:100.0 Class :character Class :character
## Median : 2 Median :121.0 Mode :character Mode :character
## Mean : 60 Mean :122.5
## 3rd Qu.: 37 3rd Qu.:140.0
## Max. :1451 Max. :206.0
## NA's :50
## danceability_. valence_. energy_. acousticness_.
## Min. :23.00 Min. : 4.00 Min. : 9.00 Min. : 0.00
## 1st Qu.:57.00 1st Qu.:32.00 1st Qu.:53.00 1st Qu.: 6.00
## Median :69.00 Median :51.00 Median :66.00 Median :18.00
## Mean :66.97 Mean :51.43 Mean :64.28 Mean :27.06
## 3rd Qu.:78.00 3rd Qu.:70.00 3rd Qu.:77.00 3rd Qu.:43.00
## Max. :96.00 Max. :97.00 Max. :97.00 Max. :97.00
##
## instrumentalness_. liveness_. speechiness_.
## Min. : 0.000 Min. : 3.00 Min. : 2.00
## 1st Qu.: 0.000 1st Qu.:10.00 1st Qu.: 4.00
## Median : 0.000 Median :12.00 Median : 6.00
## Mean : 1.581 Mean :18.21 Mean :10.13
## 3rd Qu.: 0.000 3rd Qu.:24.00 3rd Qu.:11.00
## Max. :91.000 Max. :97.00 Max. :64.00
##
After deleting the unnecessary columns, we now check if there are any missing values in our data.
barplot(colMeans(is.na(spotify.df)), las=2)
In this plot we can see that there are NA data in “in_shazam_charts” column. As the platform shazam doesn’t have the same structure as the other platforms, we decided that it was better to remove it and focus on streaming platforms where you can make playlists.
spotify.df <- subset(spotify.df, select = -c(13))#remove shazam
barplot(colMeans(is.na(spotify.df)), las=2)
We changed the type of the variables of year and month, as it doesn’t make sense that they are numeric.
Thinking of the characteristics of the data, ‘released_year’ and ‘released_month’ variables represent time periods, not continuous quantities. Thus, we thought treating them as factors will prevent them from being included inappropriately in calculations, such as mean and standard deviation that assume numerical data.
spotify.df$released_year = as.factor(spotify.df$released_year)
spotify.df$released_month = as.factor(spotify.df$released_month)
str(spotify.df)
## 'data.frame': 953 obs. of 22 variables:
## $ track_name : chr "Seven (feat. Latto) (Explicit Ver.)" "LALA" "vampire" "Cruel Summer" ...
## $ artist.s._name : chr "Latto, Jung Kook" "Myke Towers" "Olivia Rodrigo" "Taylor Swift" ...
## $ artist_count : int 2 1 1 1 1 2 2 1 1 2 ...
## $ released_year : Factor w/ 50 levels "1930","1942",..: 50 50 50 46 50 50 50 50 50 50 ...
## $ released_month : Factor w/ 12 levels "1","2","3","4",..: 7 3 6 8 5 6 3 7 5 3 ...
## $ in_spotify_playlists: int 553 1474 1397 7858 3133 2186 3090 714 1096 2953 ...
## $ in_spotify_charts : int 147 48 113 100 50 91 50 43 83 44 ...
## $ streams : num 1.41e+08 1.34e+08 1.40e+08 8.01e+08 3.03e+08 ...
## $ in_apple_playlists : int 43 48 94 116 84 67 34 25 60 49 ...
## $ in_apple_charts : int 263 126 207 207 133 213 222 89 210 110 ...
## $ in_deezer_playlists : int 45 58 91 125 87 88 43 30 48 66 ...
## $ in_deezer_charts : int 10 14 14 12 15 17 13 13 11 13 ...
## $ bpm : int 125 92 138 170 144 141 148 100 130 170 ...
## $ key : chr "B" "C#" "F" "A" ...
## $ mode : chr "Major" "Major" "Major" "Major" ...
## $ danceability_. : int 80 71 51 55 65 92 67 67 85 81 ...
## $ valence_. : int 89 61 32 58 23 66 83 26 22 56 ...
## $ energy_. : int 83 74 53 72 80 58 76 71 62 48 ...
## $ acousticness_. : int 31 7 17 11 14 19 48 37 12 21 ...
## $ instrumentalness_. : int 0 0 0 0 63 0 0 0 0 0 ...
## $ liveness_. : int 8 10 31 11 11 8 8 11 28 8 ...
## $ speechiness_. : int 4 4 6 15 6 24 3 4 9 33 ...
The higher the number of streams and presence in playlist indicates higher popularity, whereas on the ranking of charts a smaller number means higher popularity. So, we transformed the data as following.
transform_rank <- function(rank) {
ifelse(!is.na(rank), 1 / (rank + 1), NA)
}
spotify.df$in_spotify_charts <- transform_rank(spotify.df$in_spotify_charts)
spotify.df$in_apple_charts <- transform_rank(spotify.df$in_apple_charts)
spotify.df$in_deezer_charts <- transform_rank(spotify.df$in_deezer_charts)
max_rank <- max(c(na.omit(spotify.df$in_spotify_charts), na.omit(spotify.df$in_apple_charts), na.omit(spotify.df$in_deezer_charts)), na.rm = TRUE)
spotify.df$in_spotify_charts[is.na(spotify.df$in_spotify_charts)] <- 1 / (max_rank + 2)
spotify.df$in_apple_charts[is.na(spotify.df$in_apple_charts)] <- 1 / (max_rank + 2)
spotify.df$in_deezer_charts[is.na(spotify.df$in_deezer_charts)] <- 1 / (max_rank + 2)
After that, we checked if there are any outliers.
library(outliers)
boxplot(subset(spotify.df, select = -c(1,2,14,15)))$out
## [1] 8.000000e+00 4.000000e+00 5.000000e+00 4.000000e+00 4.000000e+00
## [6] 4.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00 4.000000e+00
## [11] 6.000000e+00 6.000000e+00 4.000000e+00 4.000000e+00 7.000000e+00
## [16] 4.000000e+00 4.000000e+00 4.000000e+00 8.000000e+00 7.000000e+00
## [21] 4.000000e+00 6.000000e+00 4.000000e+00 4.000000e+00 5.000000e+00
## [26] 4.000000e+00 4.000000e+00 4.000000e+01 4.100000e+01 4.100000e+01
## [31] 4.300000e+01 4.300000e+01 3.900000e+01 2.900000e+01 3.900000e+01
## [36] 3.600000e+01 1.400000e+01 4.200000e+01 3.900000e+01 3.800000e+01
## [41] 3.900000e+01 3.300000e+01 3.800000e+01 3.900000e+01 1.900000e+01
## [46] 4.100000e+01 3.800000e+01 3.800000e+01 3.500000e+01 3.100000e+01
## [51] 4.300000e+01 3.300000e+01 3.700000e+01 3.900000e+01 4.000000e+01
## [56] 3.300000e+01 4.300000e+01 3.700000e+01 3.700000e+01 1.700000e+01
## [61] 4.200000e+01 3.800000e+01 3.900000e+01 4.000000e+01 3.900000e+01
## [66] 4.300000e+01 4.100000e+01 4.000000e+01 4.100000e+01 2.300000e+01
## [71] 4.300000e+01 3.700000e+01 4.000000e+01 1.000000e+01 3.700000e+01
## [76] 1.800000e+01 4.300000e+01 3.000000e+01 2.700000e+01 4.100000e+01
## [81] 3.100000e+01 3.100000e+01 2.500000e+01 4.100000e+01 4.300000e+01
## [86] 4.200000e+01 3.200000e+01 3.800000e+01 3.000000e+01 4.300000e+01
## [91] 4.300000e+01 4.200000e+01 4.100000e+01 4.100000e+01 4.200000e+01
## [96] 1.900000e+01 1.300000e+01 1.000000e+00 3.900000e+01 2.400000e+01
## [101] 1.800000e+01 7.000000e+00 6.000000e+00 3.800000e+01 4.000000e+01
## [106] 9.000000e+00 8.000000e+00 4.000000e+01 1.100000e+01 3.800000e+01
## [111] 3.800000e+01 9.000000e+00 1.200000e+01 5.000000e+00 3.000000e+00
## [116] 1.500000e+01 1.800000e+01 1.800000e+01 4.000000e+00 2.000000e+00
## [121] 2.000000e+01 9.000000e+00 7.000000e+00 3.000000e+01 8.000000e+00
## [126] 3.400000e+01 7.000000e+00 6.000000e+00 1.100000e+01 4.200000e+01
## [131] 4.200000e+01 3.700000e+01 4.300000e+01 4.300000e+01 3.700000e+01
## [136] 4.100000e+01 4.000000e+01 2.200000e+01 3.200000e+01 3.100000e+01
## [141] 2.900000e+01 2.900000e+01 3.900000e+01 3.100000e+01 4.000000e+01
## [146] 4.300000e+01 2.500000e+01 2.900000e+01 2.900000e+01 4.200000e+01
## [151] 4.100000e+01 2.200000e+01 3.300000e+01 4.000000e+01 1.400000e+01
## [156] 4.000000e+01 4.300000e+01 3.800000e+01 4.200000e+01 4.200000e+01
## [161] 2.600000e+01 4.300000e+01 4.000000e+01 4.100000e+01 4.200000e+01
## [166] 4.300000e+01 2.800000e+01 1.600000e+01 4.100000e+01 4.300000e+01
## [171] 2.000000e+01 2.100000e+01 3.600000e+01 4.000000e+01 3.100000e+01
## [176] 3.000000e+01 4.300000e+01 1.600000e+01 2.357500e+04 1.285900e+04
## [181] 2.409400e+04 1.338700e+04 2.953600e+04 1.837100e+04 4.389900e+04
## [186] 3.135800e+04 2.254300e+04 1.641300e+04 3.389800e+04 3.112300e+04
## [191] 1.705000e+04 1.783600e+04 2.033300e+04 1.298500e+04 2.109700e+04
## [196] 1.473900e+04 2.679200e+04 4.175100e+04 2.116400e+04 2.191500e+04
## [201] 1.503200e+04 1.398500e+04 1.659600e+04 1.898600e+04 2.108100e+04
## [206] 2.001500e+04 1.713800e+04 2.921500e+04 4.325700e+04 1.268800e+04
## [211] 1.380100e+04 2.243900e+04 1.851500e+04 3.684300e+04 3.378300e+04
## [216] 2.669400e+04 2.133500e+04 2.338900e+04 3.218100e+04 3.672400e+04
## [221] 1.309100e+04 2.157400e+04 1.966400e+04 2.110600e+04 2.380400e+04
## [226] 1.589000e+04 4.492700e+04 2.076300e+04 3.250200e+04 1.572200e+04
## [231] 2.574400e+04 2.011100e+04 2.506500e+04 1.785200e+04 1.749200e+04
## [236] 2.770500e+04 2.711900e+04 1.331500e+04 2.181100e+04 1.275500e+04
## [241] 1.589400e+04 1.414300e+04 2.565300e+04 2.215300e+04 1.499400e+04
## [246] 1.416900e+04 1.411400e+04 1.556300e+04 1.586700e+04 1.268500e+04
## [251] 1.414000e+04 1.441700e+04 1.285400e+04 3.568400e+04 1.431100e+04
## [256] 1.735400e+04 2.452900e+04 1.474900e+04 2.722100e+04 4.999100e+04
## [261] 5.197900e+04 1.663600e+04 2.292300e+04 3.396600e+04 3.176200e+04
## [266] 2.949900e+04 3.042700e+04 1.877300e+04 1.906700e+04 3.320600e+04
## [271] 1.711500e+04 3.303200e+04 1.501000e+04 1.345400e+04 4.011200e+04
## [276] 5.088700e+04 2.337500e+04 2.803200e+04 4.279800e+04 5.289800e+04
## [281] 1.877800e+04 2.273000e+04 1.310100e+04 4.123100e+04 1.750400e+04
## [286] 1.980600e+04 3.099200e+04 2.513188e+09 2.808097e+09 1.647990e+09
## [291] 2.565530e+09 1.813674e+09 3.703895e+09 1.755214e+09 2.557976e+09
## [296] 2.282771e+09 1.592910e+09 2.009095e+09 2.665344e+09 2.887242e+09
## [301] 1.605225e+09 1.791001e+09 1.953534e+09 2.322580e+09 2.355720e+09
## [306] 2.559529e+09 2.594040e+09 1.687664e+09 1.481350e+09 2.011464e+09
## [311] 2.713922e+09 1.591224e+09 1.593271e+09 1.661187e+09 1.788326e+09
## [316] 1.840365e+09 3.562544e+09 2.132336e+09 1.641427e+09 2.135158e+09
## [321] 1.479115e+09 1.829993e+09 2.420461e+09 1.947372e+09 1.624166e+09
## [326] 1.606987e+09 1.897518e+09 2.303034e+09 1.814350e+09 1.555511e+09
## [331] 1.802514e+09 1.479264e+09 1.887040e+09 1.575467e+09 1.735442e+09
## [336] 1.858144e+09 1.763364e+09 1.692898e+09 1.699402e+09 1.608045e+09
## [341] 1.472800e+09 2.864792e+09 2.288695e+09 2.086124e+09 1.690193e+09
## [346] 1.806618e+09 2.159347e+09 1.695712e+09 1.759568e+09 1.929770e+09
## [351] 2.280566e+09 2.484813e+09 2.197011e+09 1.970673e+09 2.204081e+09
## [356] 2.591224e+09 1.714491e+09 2.236668e+09 2.123310e+09 1.553498e+09
## [361] 1.608164e+09 3.000000e+02 4.030000e+02 2.020000e+02 2.420000e+02
## [366] 3.720000e+02 2.910000e+02 2.410000e+02 2.810000e+02 2.500000e+02
## [371] 6.720000e+02 3.860000e+02 2.330000e+02 2.160000e+02 3.000000e+02
## [376] 4.920000e+02 4.400000e+02 3.840000e+02 2.010000e+02 4.370000e+02
## [381] 2.210000e+02 2.500000e+02 4.330000e+02 2.110000e+02 3.210000e+02
## [386] 3.280000e+02 3.910000e+02 2.150000e+02 2.580000e+02 2.340000e+02
## [391] 2.470000e+02 2.400000e+02 2.310000e+02 5.370000e+02 5.320000e+02
## [396] 3.000000e+02 2.380000e+02 3.440000e+02 2.020000e+02 3.870000e+02
## [401] 2.740000e+02 2.310000e+02 2.090000e+02 2.520000e+02 2.590000e+02
## [406] 3.820000e+02 2.750000e+02 2.400000e+02 2.310000e+02 2.390000e+02
## [411] 2.370000e+02 2.370000e+02 2.050000e+02 2.550000e+02 5.330000e+02
## [416] 3.080000e+02 2.650000e+02 3.060000e+02 3.680000e+02 2.090000e+02
## [421] 2.720000e+02 2.350000e+02 2.840000e+02 3.630000e+02 4.530000e+02
## [426] 2.460000e+02 3.210000e+02 2.330000e+02 3.150000e+02 3.360000e+02
## [431] 3.150000e+02 2.170000e+02 2.290000e+02 2.030000e+02 2.280000e+02
## [436] 2.890000e+02 2.420000e+02 2.220000e+02 2.280000e+02 1.000000e+00
## [441] 1.000000e+00 1.000000e+00 3.333333e-01 1.000000e+00 1.000000e+00
## [446] 3.333333e-01 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## [451] 1.000000e+00 1.000000e+00 1.000000e+00 3.333333e-01 3.333333e-01
## [456] 1.000000e+00 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00
## [461] 5.000000e-01 3.333333e-01 1.000000e+00 1.000000e+00 1.000000e+00
## [466] 1.000000e+00 5.000000e-01 3.333333e-01 5.000000e-01 1.000000e+00
## [471] 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00 5.000000e-01
## [476] 5.000000e-01 5.000000e-01 5.000000e-01 1.000000e+00 5.000000e-01
## [481] 1.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00 1.000000e+00
## [486] 5.000000e-01 1.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00
## [491] 1.000000e+00 1.000000e+00 3.333333e-01 1.000000e+00 1.000000e+00
## [496] 1.000000e+00 3.333333e-01 1.000000e+00 3.333333e-01 5.000000e-01
## [501] 3.333333e-01 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00
## [506] 1.000000e+00 3.333333e-01 1.000000e+00 1.000000e+00 3.333333e-01
## [511] 1.000000e+00 1.000000e+00 1.000000e+00 5.000000e-01 5.000000e-01
## [516] 1.000000e+00 5.000000e-01 1.000000e+00 5.000000e-01 1.000000e+00
## [521] 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00 3.333333e-01
## [526] 5.000000e-01 1.000000e+00 3.333333e-01 1.000000e+00 1.000000e+00
## [531] 5.000000e-01 5.000000e-01 1.000000e+00 1.000000e+00 3.333333e-01
## [536] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## [541] 3.333333e-01 1.000000e+00 1.000000e+00 3.333333e-01 3.333333e-01
## [546] 5.000000e-01 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00
## [551] 1.000000e+00 1.000000e+00 1.000000e+00 3.333333e-01 1.000000e+00
## [556] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## [561] 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00 5.000000e-01
## [566] 1.000000e+00 1.000000e+00 3.333333e-01 1.000000e+00 5.000000e-01
## [571] 5.000000e-01 3.333333e-01 5.000000e-01 5.000000e-01 5.000000e-01
## [576] 1.000000e+00 5.000000e-01 1.000000e+00 5.000000e-01 3.333333e-01
## [581] 3.333333e-01 3.333333e-01 1.000000e+00 1.000000e+00 1.000000e+00
## [586] 1.000000e+00 1.000000e+00 3.333333e-01 5.000000e-01 3.333333e-01
## [591] 5.000000e-01 1.000000e+00 5.000000e-01 5.000000e-01 1.000000e+00
## [596] 1.000000e+00 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00
## [601] 1.000000e+00 1.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00
## [606] 7.450000e+02 8.630000e+02 5.820000e+02 4.100000e+02 8.430000e+02
## [611] 5.370000e+02 4.580000e+02 2.445000e+03 3.394000e+03 3.421000e+03
## [616] 4.053000e+03 7.070000e+02 1.056000e+03 4.095000e+03 1.003000e+03
## [621] 7.980000e+02 1.800000e+03 2.703000e+03 1.632000e+03 2.394000e+03
## [626] 1.034000e+03 2.163000e+03 6.950000e+02 2.655000e+03 4.760000e+02
## [631] 6.551000e+03 1.212000e+03 1.078000e+03 5.880000e+02 2.094000e+03
## [636] 2.969000e+03 3.889000e+03 5.239000e+03 9.740000e+02 3.394000e+03
## [641] 4.530000e+02 3.631000e+03 4.350000e+02 9.290000e+02 9.390000e+02
## [646] 4.607000e+03 8.060000e+02 8.850000e+02 2.733000e+03 3.425000e+03
## [651] 1.378000e+03 1.089000e+03 6.808000e+03 6.807000e+03 7.900000e+02
## [656] 6.330000e+02 2.946000e+03 4.623000e+03 6.530000e+02 5.108000e+03
## [661] 1.145000e+03 6.570000e+02 3.271000e+03 5.430000e+02 4.220000e+02
## [666] 8.010000e+02 5.567000e+03 1.005000e+03 4.680000e+02 1.509000e+03
## [671] 1.992000e+03 7.341000e+03 6.808000e+03 5.800000e+02 7.500000e+02
## [676] 1.959000e+03 2.726000e+03 1.535000e+03 6.900000e+02 6.760000e+02
## [681] 5.570000e+02 9.450000e+02 5.440000e+02 4.630000e+02 1.891000e+03
## [686] 2.094000e+03 1.302000e+03 5.490000e+02 4.390000e+02 5.000000e+02
## [691] 4.490000e+02 4.060000e+02 5.750000e+02 4.650000e+02 9.650000e+02
## [696] 3.960000e+02 4.610000e+02 4.970000e+02 7.380000e+02 9.640000e+02
## [701] 4.850000e+02 6.120000e+02 5.250000e+02 3.940000e+02 5.830000e+02
## [706] 4.470000e+02 5.690000e+02 5.070000e+02 6.360000e+02 6.250000e+02
## [711] 6.280000e+03 5.820000e+02 4.340000e+02 1.219000e+03 4.780000e+02
## [716] 8.620000e+02 1.282000e+03 3.595000e+03 7.100000e+02 4.534000e+03
## [721] 1.236700e+04 5.840000e+02 5.063000e+03 2.854000e+03 7.100000e+02
## [726] 2.515000e+03 1.066000e+03 6.591000e+03 5.451000e+03 5.870000e+02
## [731] 5.221000e+03 1.663000e+03 1.674000e+03 1.481000e+03 4.180000e+03
## [736] 3.980000e+02 3.895000e+03 1.785000e+03 5.190000e+02 1.197000e+03
## [741] 4.920000e+02 5.691000e+03 5.310000e+02 6.284000e+03 2.692000e+03
## [746] 2.179000e+03 6.508000e+03 5.770000e+02 1.370000e+03 8.215000e+03
## [751] 2.453000e+03 2.430000e+03 5.820000e+02 9.000000e+02 4.060000e+02
## [756] 6.720000e+03 1.315000e+03 7.827000e+03 8.050000e+02 2.040000e+02
## [761] 2.060000e+02 2.020000e+02 2.020000e+02 2.060000e+02 2.500000e+01
## [766] 2.400000e+01 2.300000e+01 9.000000e+00 1.400000e+01 1.500000e+01
## [771] 1.600000e+01 6.300000e+01 1.700000e+01 2.000000e+00 1.900000e+01
## [776] 1.000000e+00 1.800000e+01 2.000000e+00 3.000000e+00 1.000000e+00
## [781] 5.100000e+01 8.000000e+00 9.000000e+00 1.000000e+00 4.000000e+00
## [786] 1.000000e+00 5.000000e+00 2.500000e+01 4.600000e+01 1.000000e+00
## [791] 5.100000e+01 1.000000e+00 1.000000e+01 1.000000e+00 9.000000e+01
## [796] 4.700000e+01 4.000000e+00 3.500000e+01 2.000000e+00 1.200000e+01
## [801] 1.000000e+00 3.000000e+00 1.000000e+00 1.300000e+01 1.000000e+00
## [806] 4.100000e+01 4.000000e+00 1.300000e+01 2.400000e+01 2.300000e+01
## [811] 6.000000e+00 1.000000e+00 4.000000e+00 1.800000e+01 2.000000e+00
## [816] 3.000000e+00 1.000000e+00 2.000000e+01 5.000000e+00 1.000000e+00
## [821] 9.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 6.300000e+01
## [826] 1.500000e+01 2.000000e+00 2.400000e+01 4.000000e+00 6.000000e+00
## [831] 8.000000e+00 9.100000e+01 2.700000e+01 7.200000e+01 4.200000e+01
## [836] 9.000000e+00 1.000000e+01 1.000000e+00 6.000000e+00 1.400000e+01
## [841] 5.000000e+00 2.000000e+00 1.000000e+00 3.000000e+00 1.000000e+00
## [846] 4.400000e+01 2.000000e+00 1.100000e+01 1.000000e+00 6.100000e+01
## [851] 6.300000e+01 5.000000e+00 1.000000e+00 8.300000e+01 1.800000e+01
## [856] 2.200000e+01 3.300000e+01 1.000000e+00 5.600000e+01 4.800000e+01
## [861] 8.300000e+01 5.300000e+01 5.000000e+01 6.400000e+01 5.800000e+01
## [866] 9.100000e+01 8.000000e+01 8.000000e+01 4.700000e+01 5.300000e+01
## [871] 5.000000e+01 6.100000e+01 5.600000e+01 6.400000e+01 9.200000e+01
## [876] 5.200000e+01 7.200000e+01 7.200000e+01 4.600000e+01 7.700000e+01
## [881] 6.600000e+01 6.000000e+01 4.900000e+01 4.600000e+01 6.600000e+01
## [886] 9.700000e+01 6.000000e+01 4.900000e+01 4.800000e+01 9.000000e+01
## [891] 6.700000e+01 5.800000e+01 6.000000e+01 4.700000e+01 4.900000e+01
## [896] 4.800000e+01 5.100000e+01 6.300000e+01 5.400000e+01 5.100000e+01
## [901] 4.800000e+01 6.300000e+01 2.400000e+01 3.300000e+01 2.800000e+01
## [906] 2.500000e+01 2.800000e+01 2.900000e+01 3.300000e+01 2.800000e+01
## [911] 3.300000e+01 3.400000e+01 2.500000e+01 2.400000e+01 2.200000e+01
## [916] 2.500000e+01 2.800000e+01 3.300000e+01 4.900000e+01 2.300000e+01
## [921] 2.200000e+01 2.400000e+01 6.400000e+01 3.000000e+01 3.900000e+01
## [926] 2.500000e+01 3.600000e+01 4.200000e+01 2.600000e+01 3.000000e+01
## [931] 3.200000e+01 3.500000e+01 3.100000e+01 3.800000e+01 2.600000e+01
## [936] 2.900000e+01 2.300000e+01 3.300000e+01 2.800000e+01 2.600000e+01
## [941] 3.200000e+01 2.700000e+01 3.200000e+01 3.400000e+01 2.400000e+01
## [946] 2.900000e+01 2.900000e+01 2.400000e+01 4.600000e+01 2.300000e+01
## [951] 3.800000e+01 3.600000e+01 3.900000e+01 3.100000e+01 2.200000e+01
## [956] 2.300000e+01 3.600000e+01 2.700000e+01 2.900000e+01 3.200000e+01
## [961] 3.500000e+01 3.000000e+01 3.600000e+01 2.200000e+01 2.300000e+01
## [966] 2.800000e+01 3.100000e+01 2.300000e+01 2.400000e+01 2.600000e+01
## [971] 3.300000e+01 2.400000e+01 2.800000e+01 2.400000e+01 2.500000e+01
## [976] 3.100000e+01 3.800000e+01 2.200000e+01 3.700000e+01 2.200000e+01
## [981] 3.000000e+01 3.400000e+01 4.000000e+01 4.000000e+01 2.400000e+01
## [986] 2.200000e+01 2.800000e+01 2.400000e+01 4.000000e+01 3.700000e+01
## [991] 3.900000e+01 2.900000e+01 2.300000e+01 3.400000e+01 2.700000e+01
## [996] 4.100000e+01 4.400000e+01 2.700000e+01 3.100000e+01 4.300000e+01
## [1001] 3.500000e+01 3.300000e+01 3.800000e+01 3.600000e+01 4.600000e+01
## [1006] 3.600000e+01 2.900000e+01 3.100000e+01 3.100000e+01 3.200000e+01
## [1011] 3.800000e+01 3.100000e+01 3.500000e+01 2.500000e+01 3.600000e+01
## [1016] 3.400000e+01 3.900000e+01 2.600000e+01 2.500000e+01 2.300000e+01
## [1021] 3.400000e+01 2.300000e+01 2.500000e+01 4.500000e+01 2.300000e+01
## [1026] 4.500000e+01 4.000000e+01 2.600000e+01 2.500000e+01 2.300000e+01
## [1031] 4.600000e+01 3.600000e+01 4.400000e+01 3.200000e+01 3.900000e+01
## [1036] 5.900000e+01 3.200000e+01 2.700000e+01
We can see that the values in ‘streams’ are extremely large. We might have to scale it in the future to prevent the results being biased.
Before we deal with that, let’s dig deeper into the data by seeing its distribution.
variables <- c("danceability_.","valence_.","energy_.","acousticness_.",
"instrumentalness_.","liveness_.","speechiness_.")
par(mfrow = c(3, 3))
for (variable in variables) {
hist(spotify.df[[variable]], main = paste("Histogram of", variable), xlab = variable)
}
music = subset(spotify.df, select = c("danceability_.","valence_.","energy_.","acousticness_.","instrumentalness_.","liveness_.","speechiness_."))
boxplot(music)$out
## [1] 25 24 23 9 14 15 16 63 17 2 19 1 18 2 3 1 51 8 9 1 4 1 5 25 46
## [26] 1 51 1 10 1 90 47 4 35 2 12 1 3 1 13 1 41 4 13 24 23 6 1 4 18
## [51] 2 3 1 20 5 1 9 1 1 30 63 15 2 24 4 6 8 91 27 72 42 9 10 1 6
## [76] 14 5 2 1 3 1 44 2 11 1 61 63 5 1 83 18 22 33 1 56 48 83 53 50 64
## [101] 58 91 80 80 47 53 50 61 56 64 92 52 72 72 46 77 66 60 49 46 66 97 60 49 48
## [126] 90 67 58 60 47 49 48 51 63 54 51 48 63 24 33 28 25 28 29 33 28 33 34 25 24
## [151] 22 25 28 33 49 23 22 24 64 30 39 25 36 42 26 30 32 35 31 38 26 29 23 33 28
## [176] 26 32 27 32 34 24 29 29 24 46 23 38 36 39 31 22 23 36 27 29 32 35 30 36 22
## [201] 23 28 31 23 24 26 33 24 28 24 25 31 38 22 37 22 30 34 40 40 24 22 28 24 40
## [226] 37 39 29 23 34 27 41 44 27 31 43 35 33 38 36 46 36 29 31 31 32 38 31 35 25
## [251] 36 34 39 26 25 23 34 23 25 45 23 45 40 26 25 23 46 36 44 32 39 59 32 27
It seems that in the music-related attributes, the scales are relatively similar. However, upon further examination of the plot, it becomes evident that attributes such as “danceability,” “valence,” and “energy” are well distributed. On the other hand, the variable “instrumentalness_.”, that indicates the presence of vocals in a track, has data that will not be useful because almost every song has lyrics, so we decided not to include it in further analysis. Therefore, we decided to keep, “acousticness_.”, “liveness_.” and “speechiness_.” as they are not perfectly distributed but also not highly biased.
Now, let’s look up whether there are common factors among music attributes. We conducted Factor Analysis with the expectation that it might lead us to the better understanding of the general patterns of songs with general popularity.
library(corrplot)
corr_matrix <- cor(spotify.df[c("danceability_.","valence_.","energy_.","acousticness_.",
"instrumentalness_.","liveness_.","speechiness_.")], use = "complete.obs")
corrplot(corr_matrix, method = "color", type = "upper",
order = "hclust", addCoef.col = "black",
tl.col = "black", tl.srt = 45,
diag = FALSE)
It seems like valence and danceability are mildly related whereas energy and acousticness have strong negative relation.
Let’s confirm this hypothesis by applying Factor analysis.
variables_fa <- spotify.df[,c("danceability_.","valence_.","energy_.","acousticness_.",
"instrumentalness_.","liveness_.","speechiness_.")]
fp <- factanal(variables_fa, factors = 2, rotation="varimax", scores="regression")
fp
##
## Call:
## factanal(x = variables_fa, factors = 2, scores = "regression", rotation = "varimax")
##
## Uniquenesses:
## danceability_. valence_. energy_. acousticness_.
## 0.791 0.005 0.568 0.005
## instrumentalness_. liveness_. speechiness_.
## 0.981 0.997 0.998
##
## Loadings:
## Factor1 Factor2
## danceability_. -0.218 0.402
## valence_. 0.997
## energy_. -0.564 0.338
## acousticness_. 0.996
## instrumentalness_. -0.132
## liveness_.
## speechiness_.
##
## Factor1 Factor2
## SS loadings 1.364 1.291
## Proportion Var 0.195 0.184
## Cumulative Var 0.195 0.379
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 71.61 on 8 degrees of freedom.
## The p-value is 2.35e-12
cbind(fp$loadings, fp$uniquenesses)
## Factor1 Factor2
## danceability_. -0.21837724 0.40158714 0.7910457
## valence_. -0.03670886 0.99682152 0.0050000
## energy_. -0.56391309 0.33781732 0.5678820
## acousticness_. 0.99646012 -0.04547383 0.0050000
## instrumentalness_. 0.03684830 -0.13197213 0.9812243
## liveness_. -0.04971375 0.01952639 0.9971506
## speechiness_. -0.02078942 0.04082132 0.9979045
factor_scores <- fp$scores
FA with two factors and with Varimax rotation doesn’t seem like it is capturing the structure of data.
This is because when we look at the SS loadings, both factors are fairly close, indicating that each factor is explaining a somewhat similar amount of variance in the data. Moreover, Factor1 explains 19.5% and Factor2 explains 18.4% of the variance, thus a total of 32.9% of the variance. Also, a significant P-value indicates that model doesn’t fit.
So, to find the best number of factors, we have to examine the eigenvalue which corresponds to a factor, and its value indicates the amount of variance explained by that factor.
eigenvalues <- eigen(cor(variables_fa))$values
print(eigenvalues)
## [1] 1.9841316 1.2251199 1.0371337 0.9459922 0.8755260 0.6171924 0.3149041
plot(eigenvalues, type="b", main="Scree Plot", ylab="Eigenvalue", xlab="Number of Factors", col="blue", pch=19)
abline(h=1, col="red", lty=2)
According to Kaiser criterion, you should retain all factors with an eigenvalue above 1. Moreover, the ‘elbow’ in the scree plot typically indicates the appropriate number of factors.
(ref. Braeken, Johan & Assen, Marcel. (2016). An Empirical Kaiser Criterion. Psychological methods. 22. 10.1037/met0000074.)
For all these reasons, we decided to change the number of factors to 3.
variables_fa <- spotify.df[,c("danceability_.","valence_.","energy_.","acousticness_.",
"instrumentalness_.","liveness_.","speechiness_.")]
fp <- factanal(variables_fa, factors = 3, rotation="varimax", scores="regression")
fp
##
## Call:
## factanal(x = variables_fa, factors = 3, scores = "regression", rotation = "varimax")
##
## Uniquenesses:
## danceability_. valence_. energy_. acousticness_.
## 0.170 0.005 0.039 0.589
## instrumentalness_. liveness_. speechiness_.
## 0.980 0.972 0.954
##
## Loadings:
## Factor1 Factor2 Factor3
## danceability_. 0.196 0.356 0.815
## valence_. 0.124 0.989
## energy_. 0.944 0.247
## acousticness_. -0.625 -0.140
## instrumentalness_. -0.130
## liveness_. 0.109 -0.126
## speechiness_. 0.211
##
## Factor1 Factor2 Factor3
## SS loadings 1.348 1.184 0.758
## Proportion Var 0.193 0.169 0.108
## Cumulative Var 0.193 0.362 0.470
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 8.73 on 3 degrees of freedom.
## The p-value is 0.0331
cbind(fp$loadings, fp$uniquenesses)
## Factor1 Factor2 Factor3
## danceability_. 0.195959585 0.356168337 0.81542257 0.16983003
## valence_. 0.124219807 0.988960032 0.03908437 0.00500000
## energy_. 0.944199722 0.246629144 -0.09168768 0.03925434
## acousticness_. -0.625343445 0.001224361 -0.13961214 0.58946712
## instrumentalness_. -0.013374139 -0.130496558 -0.05384769 0.97989280
## liveness_. 0.108589661 0.012885364 -0.12603894 0.97215325
## speechiness_. 0.006354728 0.032469842 0.21091653 0.95444644
factor_scores <- fp$scores
We can now see that with three factors, we account for 47% of the total variance. With a p-value of 0.0331, the result is statistically significant at the usual 0.05 level, which suggests that the three-factor model fits the data better than the previous model.
However, with this analysis we only have 47% of the total variance explained, thus it might not be a general pattern that can be applied to the full data.
We can assume that the reason for the low total variance is that half of the data we used was not normally distributed. When we look up the histograms and boxplots of the variances (line 275), we can see that except for the data about danceability, valence, energy, other data has high positive skewness and several outliers.
Thus, we can conclude that the made estimation of the covariance matrix is biased, leading to inaccurate factor loadings and unique variances. As factor analysis assumes multivariate normality because it relies on the covariance matrix.
From now on let’s cluster the number of streams and musical attributes and see how they are related.
Now we will conduct a clustering analysis to find out whether there are relations between the popularity and musical characteristics of the songs.
For clustering, especially for K-means clustering, it is important to scale the data since K-means clustering uses Euclidean distance to calculate the distance. So, we will first scale the data.
Moreover, we are going to perform K-means clustering on YJ transformed data (conducted on the line 190), and non-transformed data to see whether the distribution of data affects the result.
library(scales)
library(cluster)
library(factoextra)
library(heatmaply)
scaled_data <- as.data.frame(scale(spotify.df[,c("streams","in_spotify_playlists", "in_apple_playlists","in_deezer_playlists","danceability_.","valence_.","energy_.","acousticness_.","instrumentalness_.","liveness_.","speechiness_.")]))
scaled_data_YJ <- as.data.frame(scale(spotify.df[,c("streams_YJ","in_spotify_playlists_YJ", "in_apple_playlists_YJ","in_deezer_playlists_YJ","danceability_.","valence_.","energy_.","acousticness_.","instrumentalness_.","liveness_.","speechiness_.")]))
heatmaply(scaled_data)
heatmaply(scaled_data_YJ)
Through looking up the heatmap of the data we can see that the data set ‘scaled_data_yj’ which includes transformed data shows more clear clusters among variables. However, we can say that the best number of K will be two when it comes to the unscaled data and 3 when it is scaled.
Now let’s determine the optimal number of clusters for the scaled_data and scaled_data_yj by using the elbow method, silhouette method.
set.seed(123) # Setting seed for reproducibility
# Compute total within-cluster sum of square
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_data, centers=i)$withinss)
fviz_nbclust(scaled_data, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)
fviz_nbclust(scaled_data, kmeans, method = "silhouette")
In the first graph, the vertical line shows at which point of the number
of clusters the within-cluster sum of squares (WSS) starts to decrease
more slowly. In the elbow method the elbow point shows a good balance
between the number of clusters and the compactness of the clustering. In
this case, we can see the suggested number of K is 3.
In the second graph, we can see measurements of how close each point is in one cluster to the points in the neighbouring clusters. So in this other value the suggested value for K is 2.
However, since we put two different types of data together K=2 might not show the desirable result we want. Thus, thinking of the purpose of this analyzation and the attribute of the data, we will set K = 3.
Let’s look up for the best K in scaled_data_YJ.
wss <- (nrow(scaled_data_YJ)-1)*sum(apply(scaled_data_YJ,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_data_YJ, centers=i)$withinss)
fviz_nbclust(scaled_data_YJ, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)
fviz_nbclust(scaled_data_YJ, kmeans, method = "silhouette")
In this case, the average silhouette width is larger than sclaed_Data but the recommend K is the same.
The k-means clustering can be very sensitive to the curse of dimensionality. Thus, before doing K-means clustering let’s reduce the dimensionality for better interpretation and reduction of noise.
pca_result <- prcomp(scaled_data, center = TRUE)
plot(pca_result)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7701 1.4111 1.1019 1.01839 0.97587 0.92581 0.78859
## Proportion of Variance 0.2848 0.1810 0.1104 0.09428 0.08657 0.07792 0.05653
## Cumulative Proportion 0.2848 0.4658 0.5762 0.67050 0.75707 0.83500 0.89153
## PC8 PC9 PC10 PC11
## Standard deviation 0.75490 0.56063 0.44925 0.32733
## Proportion of Variance 0.05181 0.02857 0.01835 0.00974
## Cumulative Proportion 0.94334 0.97191 0.99026 1.00000
It seems like with scaled_data with 3 components 57% of variance can be explained.
pca_result <- prcomp(scaled_data_YJ, center = TRUE)
plot(pca_result)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8220 1.4124 1.1011 1.01833 0.97442 0.93021 0.78716
## Proportion of Variance 0.3018 0.1814 0.1102 0.09427 0.08632 0.07866 0.05633
## Cumulative Proportion 0.3018 0.4831 0.5934 0.68763 0.77394 0.85261 0.90893
## PC8 PC9 PC10 PC11
## Standard deviation 0.58185 0.55910 0.47030 0.35971
## Proportion of Variance 0.03078 0.02842 0.02011 0.01176
## Cumulative Proportion 0.93971 0.96813 0.98824 1.00000
It seems like with scaled_Data_YJ with 3 components 59% of variance can be explained.
Now let’s do the clustering on each data.
k <- 3
kmeans_result <- kmeans(scaled_data, centers = k, nstart = 25)
print(kmeans_result)
## K-means clustering with 3 clusters of sizes 625, 100, 228
##
## Cluster means:
## streams in_spotify_playlists in_apple_playlists in_deezer_playlists
## 1 -0.3076466 -0.3121457 -0.2182398 -0.2583447
## 2 2.2545559 2.3721414 1.9453369 2.1210144
## 3 -0.1455107 -0.1847503 -0.2549729 -0.2220878
## danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1 0.3332213 0.26203427 0.3621928 -0.3578429 -0.1252494
## 2 -0.1776802 0.04679305 0.1535227 -0.2095590 -0.1701969
## 3 -0.8355055 -0.73881897 -1.0601876 1.0728409 0.4179849
## liveness_. speechiness_.
## 1 0.0669662 0.1289670
## 2 -0.1263936 -0.1584972
## 3 -0.1281338 -0.2840116
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 3 1 3 1 1 1 1 1 1 1 2 3 2 1 1 3 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 2 2 3 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 3
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 2 1 3 1 3 1 2 3 2 1 2 3 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 2 3 3 1 2 1 2 1 3 1 1 1 1 3 1 1 3 1 2 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 3 1 1 2 1 3 2 1 1 2 2 3 1 1 2 2 1 1 1 3
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 3 1 1 2 3
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 1 1 2 1 3 1 2 3 1 1 1 2 1 1 1 1 2 1 1
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 2 3 2 1 2 3 2 3 3 3 2 2 1 1 2 2 1 2
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 3 1 2 3 3 1 3 2 2 2 1 1 1 1 3 1 1 1 1 2
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 3 1 3 1 3 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 1 3 1 1 1 1 3 1 3 1 1 1 1 1 1 3 3 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 3 1 3 1 3 1 1 1 2 1 1 3 1 1 1 1 1 3
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 3 1 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 3 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 1 1 3 1 1 1 3 3 1 1 3 1 1 1 2 1 1 1
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 1 1 1 1 1 1 1 3 1 1 1 3 1 1 3 1 1 1 1 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 1 1 3 1 2 2 3 3 1 1 3 2 1 1 1 1 1 1 1 1
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 3 3 1 3 1 1 1 1 3 3 1 1 1 1 1 1 3 2 1
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 1 1 1 3 3 1 3 1 1 1 3 3 1 1 1 1 3 1 1 3
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 2 3 1 2 1 1 3 2 3 1 2 3 3 1 1 3 3 3 1 1
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 1 1 1 1 3 3 1 3 3 2 1 3 1 2 1 3 1 3 3 1
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 2 2 2 3 3 3 1 3 3 3 1 1 3 1 1 3 3 3 1 3
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 3 1 1 3 3 1 3 3 3 3 1 3 3 1 1 1 3 1 1 1
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 3 1 1 3 1 1 3 3 1 1 1 1 1 3 1 1 3 3 3 1
## 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
## 1 1 1 3 1 1 1 2 1 1 1 1 1 2 3 3 2 1 3 2
## 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540
## 1 3 1 3 1 1 1 1 1 1 1 3 1 3 3 3 1 3 1 1
## 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 1 3 1 1 3 1 1 1 1 3 1 1 1 1 2 2 1 1 1 1
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580
## 1 1 1 1 1 3 2 1 1 1 1 2 1 1 1 3 1 3 3 3
## 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600
## 1 3 1 1 2 1 1 3 3 1 1 3 1 2 1 1 3 1 2 1
## 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620
## 1 1 1 3 1 1 3 1 3 3 1 1 1 3 3 2 1 1 3 1
## 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 2 2 1 2 2 3 1 1 1 1 2 1 1 3 2 1 1 1 1 1
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660
## 2 2 1 1 1 1 1 1 1 2 1 3 3 2 1 1 3 2 2 1
## 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680
## 1 3 1 1 3 1 1 1 1 1 1 1 2 2 1 3 1 1 1 1
## 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700
## 1 1 1 3 3 2 3 3 1 3 1 3 1 2 1 2 1 3 1 1
## 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 1 1 1 3 3 1 1 1 1 1 3 1 3 1 1 3 1 2 3 1
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740
## 2 1 1 1 2 2 1 2 3 1 1 3 1 1 1 1 1 1 3 1
## 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760
## 1 1 1 1 1 1 3 3 1 1 3 1 1 1 2 1 1 2 1 1
## 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780
## 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
## 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 3 3 1 1
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820
## 3 3 3 1 1 1 1 3 1 3 1 1 3 1 1 1 1 1 1 3
## 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840
## 2 1 3 3 3 3 1 1 1 3 3 3 1 1 1 1 3 1 1 3
## 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860
## 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 3 1
## 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 3 1 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900
## 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1
## 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920
## 1 1 3 1 1 1 1 1 1 3 2 3 1 1 3 1 1 3 1 1
## 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940
## 1 1 1 3 1 1 1 1 1 1 1 1 1 3 1 1 3 1 3 1
## 941 942 943 944 945 946 947 948 949 950 951 952 953
## 3 1 3 1 1 1 1 1 3 3 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 3622.788 1465.447 2075.497
## (between_SS / total_SS = 31.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
spotify.df$cluster <- kmeans_result$cluster
pca_result <- prcomp(scaled_data)
spotify.df$pca1 <- pca_result$x[,1]
spotify.df$pca2 <- pca_result$x[,2]
ggplot(spotify.df, aes(x = pca1, y = pca2, color = as.factor(cluster))) +
geom_point() +
labs(title = "K-means Clustering with PCA") +
theme_minimal()
cluster_summary <- aggregate(scaled_data, by=list(cluster=spotify.df$cluster), FUN=mean)
cluster_summary
library(reshape2)
cluster_melted <- melt(cluster_summary, id.vars="cluster")
ggplot(cluster_melted, aes(x=factor(cluster), y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
theme_minimal() +
labs(y="Mean Value", x="Cluster", fill="Variable") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The clusters in this plot appear to be more spread out, particularly for the green points. However, there is still some overlap between clusters, particularly between the green and red points.
kmeans_result_2 <- kmeans(scaled_data_YJ, centers = k, nstart = 25)
print(kmeans_result_2)
## K-means clustering with 3 clusters of sizes 169, 321, 463
##
## Cluster means:
## streams_YJ in_spotify_playlists_YJ in_apple_playlists_YJ
## 1 -0.09299754 -0.09229419 -0.3165127
## 2 0.95298382 1.01220571 0.9925568
## 3 -0.62676290 -0.66807844 -0.5726136
## in_deezer_playlists_YJ danceability_. valence_. energy_. acousticness_.
## 1 -0.1880920 -0.9705911 -0.74589602 -1.2624676 1.3074905
## 2 0.9815881 0.0295476 0.07888286 0.2261371 -0.2931817
## 3 -0.6118839 0.3337907 0.21757026 0.3040324 -0.2739839
## instrumentalness_. liveness_. speechiness_.
## 1 0.4916472 -0.17305348 -0.3152712
## 2 -0.0920912 -0.06393032 -0.1452227
## 3 -0.1156093 0.10748957 0.2157609
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 3 3 1 2 1 3 2 3 3 3 2 3 2 1 2 2 2 1 3 3
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 3 3 2 3 3 3 2 3 1 3 3 3 2 2 2 3 3 2 2 3
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 3 2 2 2 3 2 2 2 2 2 3 2 3 3 2 2 3 2 3 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 3 2 3 2 2 2 3 1 3 3 3 2 1 2 2 2 1 2 2 3
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 2 1 1 3 2 2 2 2 2 3 3 3 3 1 3 3 1 2 2 3
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 2 3 3 2 3 1 2 2 2 2 2 1 3 3 2 2 2 3 3 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 3 2 3 3 3 2 2 2 1 3 3 3 2 2 2 1 3 3 2 1
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 3 3 2 3 1 2 2 1 3 3 3 2 3 3 3 2 2 2 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 3 2 2 2 2 2 2 1 2 1 1 3 2 2 3 2 2 2 2 2
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 3 2 1 1 3 2 2 2 2 3 3 2 3 1 2 3 3 3 2
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 3 3 1 3 3 3 3 3 1 3 1 3 3 3 3 2 2 2 3 3
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 3 3 3 1 2 3 3 3 3 3 2 2 3 3 3 3 3 1 3 2
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 3 3 1 2 1 2 2 3 3 3 2 3 3 1 3 3 3 3 3 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 3 1 3 3 3 3 3 3 3 1 2 3 3 2 3 3 3 2 2
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 3 3 3 3 1 3 3 3 2 3 3 3 2 3 3 2 2 2 3 3
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 3 3 3 2 3 3 3 3 2 3 2 1 3 3 1 3 3 3 3 3
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 2 3 2 2 2 2 1 1 3 2 1 2 3 3 3 2 3 3 3 3
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 3 1 1 3 1 3 3 3 3 1 1 3 3 2 2 3 3 3 2 3
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 3 2 3 1 1 2 1 3 2 3 1 2 3 3 3 3 3 2 3 3
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 3 3 3 3 3 3 3 1 2 2 2 3 2 3 3 2 3 2 3 1
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 2 1 3 2 3 2 2 2 1 3 2 2 1 3 3 1 1 1 2 3
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 3 3 2 2 2 1 3 2 3 2 3 1 2 2 2 3 2 1 1 3
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 2 2 2 2 2 1 2 1 1 2 2 3 1 2 2 3 1 2 3 1
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 1 3 2 1 1 3 1 3 1 1 3 2 1 3 3 3 1 2 3 3
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 1 3 3 1 3 3 1 1 3 3 3 3 2 1 3 2 1 3 1 3
## 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
## 2 2 3 1 2 2 2 2 3 3 2 3 3 2 1 3 2 3 2 2
## 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540
## 2 1 3 1 2 2 2 2 2 2 2 1 2 1 1 2 3 1 3 2
## 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 3 1 3 3 1 2 3 2 2 1 2 2 3 2 2 2 3 1 2 2
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580
## 2 2 2 3 2 2 2 3 2 3 2 2 2 2 3 1 3 1 1 1
## 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600
## 2 2 3 3 2 3 3 1 1 3 2 2 3 2 2 2 1 2 2 2
## 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620
## 2 3 3 2 2 2 1 3 1 1 2 3 3 1 1 2 2 2 1 3
## 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 2 2 2 1 2 1 2 3 3 2 2 3 2 2 2 3 2 2 2 2
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660
## 2 2 3 3 3 3 2 3 2 2 3 1 1 2 2 3 1 2 2 2
## 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680
## 3 1 2 2 1 3 3 3 2 2 2 3 2 2 2 1 3 3 3 3
## 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700
## 3 3 3 1 1 2 1 1 3 1 3 1 3 2 3 2 2 1 3 2
## 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 2 2 3 1 1 3 3 3 3 3 1 3 1 3 3 1 3 2 1 2
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740
## 2 3 3 3 2 2 3 2 1 2 3 1 3 3 3 2 3 3 3 2
## 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760
## 3 3 3 3 3 3 1 3 3 3 1 3 3 3 2 3 2 2 3 3
## 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780
## 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3
## 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 3 3 3 1 3 3 1 3 3 3 2 3 3 2 3 3 1 1 2 3
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820
## 1 1 1 3 3 3 3 3 3 1 3 2 3 3 3 3 3 3 3 1
## 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840
## 2 3 1 1 1 1 3 3 2 1 3 1 2 3 2 3 1 3 3 1
## 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860
## 2 3 1 3 3 2 3 3 3 2 2 3 2 3 2 2 1 3 1 3
## 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 1 3 2 2 3 2 3 3 3 3 3 2 2 3 3 3 3 3 3 2
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900
## 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 2
## 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920
## 2 3 2 2 3 3 3 2 3 1 2 1 3 3 3 3 3 1 2 3
## 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940
## 3 3 3 1 2 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3
## 941 942 943 944 945 946 947 948 949 950 951 952 953
## 1 3 1 3 3 3 3 3 1 1 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 1928.710 2094.815 3347.592
## (between_SS / total_SS = 29.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Add the cluster assignments to your original data frame
spotify.df$cluster_2 <- kmeans_result_2$cluster
# Cluster visualization
pca_result_2 <- prcomp(scaled_data_YJ)
spotify.df$pca_2_1 <- pca_result_2$x[,1]
spotify.df$pca_2_2 <- pca_result_2$x[,2]
ggplot(spotify.df, aes(x = pca_2_1, y = pca_2_2, color = as.factor(cluster))) +
geom_point() +
labs(title = "K-means Clustering with PCA") +
theme_minimal()
The clusters here seem to be somewhat mixed, with no clear boundaries, especially between the green and blue points. This suggests that the Box-Cox transformation with subsequent PCA did not result in entirely distinct groupings in the first two principal components. However, we can see that the data is less biased.
cluster_summary_2 <- aggregate(scaled_data_YJ, by=list(cluster_2=spotify.df$cluster_2), FUN=mean)
cluster_summary_2
cluster_melted <- melt(cluster_summary_2, id.vars="cluster_2")
ggplot(cluster_melted, aes(x=factor(cluster_2), y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
theme_minimal() +
labs(y="Mean Value", x="Cluster", fill="Variable") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# To look up songs on each cluster
library(dplyr)
library("purrr")
get_top_tracks <- function(data, cluster_number, n = 5) {
data %>%
filter(cluster == cluster_number) %>%
slice_head(n = n) %>%
pull(track_name)
}
clusters <- unique(spotify.df$cluster)
for (cluster in clusters) {
top_tracks <- get_top_tracks(spotify.df, cluster)
cat('Cluster', cluster, ':\n')
print(top_tracks)
cat('\n\n')
}
## Cluster 1 :
## [1] "Seven (feat. Latto) (Explicit Ver.)" "LALA"
## [3] "Cruel Summer" "Sprinter"
## [5] "Ella Baila Sola"
##
##
## Cluster 3 :
## [1] "vampire"
## [2] "WHERE SHE GOES"
## [3] "Daylight"
## [4] "What Was I Made For? [From The Motion Picture \"Barbie\"]"
## [5] "I Wanna Be Yours"
##
##
## Cluster 2 :
## [1] "Flowers"
## [2] "As It Was"
## [3] "Sunflower - Spider-Man: Into the Spider-Verse"
## [4] "I'm Good (Blue)"
## [5] "Starboy"
Through comparing cluster of the two data, the patterns appear similar across the clusters, but there are differences in the scale of the mean values, particularly for the streaming-related variables since we transformed the streaming-related variables.
scaled_data_YJ: Appears to have moderately low values for streaming-related variables and positive mean values for danceability and valence, while having negative mean values for energy and speechiness.
scaled_data: Shows a similar pattern but with more pronounced mean values for streaming-related variables, indicating this cluster might be characterized by fewer streams and presence in playlists but still maintains a certain level of danceability and valence.
Thus, we can say that through cluster 1, we can see that it is a cluster of songs which are not very streamed, are less intense, active (energy) and contain less spoken words (speechiness). We can assume that these songs are not widely streamed or featured in playlists can still possess a certain appeal due to their danceability and positive mood, despite being less intense and having fewer spoken words. As an example, we can think of genres like smooth jazz or acoustic pop, which might not top the streaming charts but are danceable in a more laid-back way and generally have a positive tone.
scaled_data_YJ: This cluster has the highest mean values for the streaming-related variables and positive mean values for danceability, valence, and energy, which means that those songs in this cluster are popular across streaming platforms and tend to be more upbeat and energetic.
scaled_data: Again, shows higher values for the streaming-related variables, but with a more significant difference from the other clusters, reinforcing the suggestion that this cluster contains the most popular and engaging songs.
Thus, we can say that through cluster 2, we can see that it is a cluster of songs which are played by certainly large number of users regardless of the platform, and upbeat and energetic.
Therefore, we can conclude that usually highly streamed songs tend to feel more positive, and danceable such as Pop songs, EDM.
scaled_data_YJ: Displays lower values for streaming-related variables and particularly low (negative) values for danceability, energy, and valence, implying songs in this cluster are less popular, less danceable, less energetic, and have lower positivity in their musical tone.
scaled_data: Similar to the first one, with low streaming performance and low mean values for danceability, energy, and valence.
As we can see from the scatter plot, cluster 1 and 3 have a higher similarity than with cluster 2. Thus cluster 3 indicates pretty much the same thing as cluster 1, but it is a cluster of songs which have more acoustic in the song and fewer vocal parts, more focused on instrumental music.
Songs with higher levels of danceability, energy, and valence tend to be more popular, attributed to their lively and compelling characteristics. These qualities make such songs ideal for diverse listening scenarios, ranging from parties and workout sessions to mainstream radio play. In contrast, songs with lower scores in these attributes often appeal to specialized markets or specific contexts, such as relaxed environments, introspective times, or thematic playlists designed to evoke certain moods.
In examining the results from the scaled data and the Yeo-Johnson transformation (scaled data_YJ), we find that a more evenly distributed dataset does not necessarily influence the outcome of our clustering model. Nonetheless, it remains critical to choose appropriate data processing techniques, considering the data characteristics and the nature of the analytical problem at hand.
While we have corroborated our initial hypothesis, it is important to note that since PCA 1 and 2 account for only about 50% of the total variance when applied to both datasets, we cannot confidently assert that this represents the trends across the entire dataset.
In summary, a more broadly distributed dataset may enhance our understanding of the overall trends, although the extent of this benefit is contingent upon the specific nuances of the data in question.